Outstanding Objects (Developed Dirt Cheap)

Mark Leighton Fisher on 2003-06-05T20:30:33

Tom Christiansen wrote about how useful regexes and hashes are but nobody ever writes them -- the first time I read that, I thought "gee, I've written *both*!" (I wish I could remember where he said this...)

Why is this, I wonder? My first hash was around 1982-1983 for a gperf-like utility, along with my first glob code (lead-up to my first regex code) at around the same time. I do remember that my peers warned me that these algorithms would be too slow, so I should stick with the algorithms already in use (plain string searches and linked-lists IIRC).

Perhaps the major lesson I took away from learning Unix as a second system (the first being a Purdue derivative of the CDC KRONOS operating system) was that you should remember to always research the literature before setting out to code something new, as someone else has probably already encountered and solved the problem you are now facing.

So what constitutes the literature these days? For a long time, the literature consisted of academic papers in paper journals that you needed to search through manually. By the early 1980's, many college and university libraries had electronic indexes, but you still needed to get and copy the necessary articles by hand as only abstracts were provided online.

John Lions books of the annotated Unix Version 6 kernel were among the first widely-available copies of annotated operating system source code. Through a samizdat-like process, many aspiring software engineers and programmers learned about operating systems (and programming-in-the-large) through those books, even though those books were supposed to be restricted to Unix Version 6 licensees.

Started in 1984, the GNU Project not only gave the world Unix-like command-line tools, it also provides a wealth of source code to peruse, ponder, and possibly borrow from (subject to GPL restrictions). GNU's free software gave impetus to what became the less-license-restrictive Open Source (I leave to the reader the ethics of Open Source vs. Free Software).

The invention of the World Wide Web along with the commercialization of the Internet has made it possible for anyone to learn from some of the best software engineers, programmers, and computer scientists in the world right in the comfort of their own bedroom. When I took computer science and programming courses (right after the dinosaurs had died off), you learned about good code by trying to write it, but you rarely read over other people's code. This is in contrast to learning to write natural languages, where you read and study the great literature of that language before anyone expects you to write even mediocre prose (or poetry). My brief perusal of current university course catalogs leads me to the same conclusion about nowadays (learning by only writing code), although there is so much to learn just in the basics of programming that giving short shrift to programming-in-the-large issues is understandable — and much of what you learn from reading code deals with the issues of programming large systems.

With this wealth of information to learn from, why don't more developers make use of this information? It is true that many academic papers (now available in full online) cover subject matter that is applicable in only narrow circumstances. But academic papers today are only part of what can be called the literature — there is so much source code available in programs and libraries that you hardly know where to start.

IMHO, the reasons developers don't make use of this information break down into three categories:

"It won't work in the real world." (of academic interest only)
"I can do it better." (misapplied hubris)
"It isn't what I want." (misapplied hubris as well as lack of laziness)

(You might want to pause at this point to read or re-read about the three principal virtues of programmers -- laziness, hubris, and impatience.) I'll try to deal with these one at a time:

"It won't work in the real world." The short answer to this is, "You won't know whether this technique works until you profile it." There are known suboptimal algorithms (bubblesort and bogo-sort come to mind), but generally developers and academics are trying to improve algorithms and systems, so looking to see what others have done in solving a particular problem is worthwhile. This is an example of false impatience — "I'm in too much of a hurry to look at algorithms that won't work anyway." (I have some experience with this, as I worked on PC databases and PC-based electronic CAD before anyone thought you could do those on PCs.)
"I can do it better." This is an example of misapplied hubris, or of letting hubris get in the way of impatience. Of course you can do it better — but wouldn't it be better to add your special modifications to existing code thereby making it perfect :) rather than writing all of it from scratch? (See CPAN below.)
"It isn't what I want." This is another example of letting hubris get in the way of impatience, with the same results as (2). That being said, I know that library design is hard work, while we live in an age where there is barely time to code it once for the current project. But the hardest part (IMHO) is figuring out when and where to factor out the code that will need to be used or written again. (Again, see CPAN below.) It does pay off, as I have 18-year-old regex matching code still in use today.

The other key point of searching the literature (IMHO) is that you may (re)discover whole areas of Software Engineering and Computer Science that are of use in current or future projects. The portability and usefulness of Perl depends in part on the use of YACC AND providing the YACCed Perl source so that people can compile and use Perl without themselves possessing YACC-like tools. I have seen a number of examples over the years where a programming language is included in a product but the language is handicapped because the developer chose not to use YACC/LEX-style tools. The next time you write a program where part of the functionality is classifying stuff, you could examine the multitude of expert system and descriptive logics toolsets, as that is what these toolsets are designed for (as another example).

This all leads to CPAN, especially to the CPAN modules, where you can find (probably) all algorithms in common use that are useful in Perl. CPAN makes it possible to easily assemble systems with high degrees of functionality and complexity. For example, I was able to write by myself a complete knowledge management system (the TCE Corporate Technical Memory) in around 6400 LOC Perl using DBI (backend metadata) and Text::MetaText (precursor to Template::Toolkit, for frontend HTML template processing) with the core functionality in a 1000 LOC Perl module (the rest was CGI programs and maintenance utilities).

As we developers gain the knowledge to group together functionality into larger objects, it will become possible to build larger and more complex systems with minimal effort. A good example of this is the current ability to embed full programming languages into other programs, as can be done with Perl, Tcl/Tk, and LISP among other languages. Programming languages — large, complex entities — can be embedded because they are now well-understood enough to have well-defined interfaces. Other functionalities that can now be embedded but were unthinkable as few as 10 years ago are regular expressions (via Perl-Compatible Regular Expressions) and text template systems (the aforementioned Template Toolkit among others). If the OpenCYC project bears fruit, in a few years developers will be able to embed a full natural language interface with reasoning abilities into their systems.

Although sometimes the wheel needs reinventing (modern vehicles would be much less comfortable to drive without modern suspension systems), it needs reinvention less often than expected by those who do not take the time to research the literature first. And if developers leave themselves open to serendipity, they may discover whole areas of software engineering and computer science that can help them accomplish their objectives — better/faster/cheaper programs, libraries, and systems.

P.S. For those not conversant with 1980's rock music, the title of this entry is a play on AC/DC's song "Dirty Deeds (Done Dirt Cheap)" :).

citeseer

rozim on 2003-06-07T00:00:16

A useful site for "prior art", esp in terms of techie computer science papers, is citeseer.

Re:citeseer

inkdroid on 2003-06-10T21:43:10
Networked Computer Science Technical Reference Library is a decent resource too.